Recover stuck service operations after transient DB failures#5011
Recover stuck service operations after transient DB failures#5011serdarozerr wants to merge 14 commits into
Conversation
…e operations Introduces a new periodic recovery job that scans permanently failed delayed_jobs and re-enqueues polling for service operations still in progress at the broker. Recovers cases where a transient DB connection error caused the polling job to fail permanently (max_attempts=1) while the broker operation was still running, leaving the service instance stuck in 'in progress' with no active poller.
The previous implementation queried dead delayed_jobs then performed separate lookups per row to find the pollable job, entity, and last operation state. Replace with a single 4-table join across service_instance_operations, service_instances, jobs, and delayed_jobs, filtering all conditions in one query
a3a10fe to
f37a14c
Compare
| @logger ||= Steno.logger('cc.background') | ||
| end | ||
|
|
||
| def batch_size |
There was a problem hiding this comment.
This could be a constant: BATCH_SIZE.
| module Runtime | ||
| class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob | ||
| def perform | ||
| logger.info('Recover halted delayed jobs') |
There was a problem hiding this comment.
I would prefer: "Recovering stuck delayed jobs"
| join(:delayed_jobs, guid: Sequel[:jobs][:delayed_job_guid]). | ||
| where(Sequel[:service_instance_operations][:state] => 'in progress'). | ||
| where(Sequel[:service_instance_operations][:type] => 'create'). | ||
| where { Sequel[:service_instance_operations][:created_at] > cutoff_time }. |
There was a problem hiding this comment.
You can use something similar to FailedJobsCleanup (only DB time is used):
where(Sequel.lit("#{Sequel[:service_instance_operations][:created_at]} > CURRENT_TIMESTAMP - INTERVAL '?' SECOND", default_maximum_duration_seconds.to_i))
| def max_attempts | ||
| 1 | ||
| end | ||
|
|
There was a problem hiding this comment.
Please also add a job_name_in_configuration method.
| # unwrap the serialized handler and re-enqueue via the reoccurring job's enqueue_next_job method | ||
| inner_job = Jobs::Enqueuer.unwrap_job(delayed.payload_object) | ||
| inner_job.send(:enqueue_next_job, pjob) | ||
| end |
There was a problem hiding this comment.
I think it would be good to add an explicit log for every re-enqueued job.
| end | ||
|
|
||
| def logger | ||
| @logger ||= Steno.logger('cc.background') |
There was a problem hiding this comment.
The logger should be more specific, e.g. cc.background.delayed-jobs-recover.
| PollableJobModel.db.transaction do | ||
| pjob = PollableJobModel.where(guid: pollable_guid, | ||
| delayed_job_guid: delayed.guid). | ||
| for_update.first |
There was a problem hiding this comment.
Maybe add skip_locked - so if two processes try to lock the same row, the first one succeeds and the second one does nothing.
| where(Sequel[:service_instance_operations][:state] => 'in progress'). | ||
| where(Sequel[:service_instance_operations][:type] => 'create'). | ||
| where { Sequel[:service_instance_operations][:created_at] > cutoff_time }. | ||
| where(Sequel[:jobs][:state] => [PollableJobModel::POLLING_STATE, PollableJobModel::FAILED_STATE]). |
There was a problem hiding this comment.
After giving this some additional thoughts, I think that we should not take failed jobs into account here. This is a terminal state that we already exposed to the user - bringing a "dead job back to life" would violate user expectations.
What we might want to have in addition is triggering orphan mitigation for the combination pollable job is failed and service instance operation is in progress.
|
When focusing on failed jobs where the pollable job is still |
| # Test up migration | ||
| expect(db.indexes(:jobs)).not_to include(:jobs_operation_state_index) | ||
| expect { Sequel::Migrator.run(db, migrations_path, target: current_migration_index, allow_missing_migration_files: true) }.not_to raise_error | ||
| expect(db.indexes(:jobs)).to include(:jobs_operation_state_index) |
There was a problem hiding this comment.
This should also work for postgres. Then you could get rid of the if.
| # pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set. | ||
| # Override individual parameters to break a single filter and test exclusion. | ||
| def make_stuck_scenario( | ||
| sio_state: 'in progress', |
There was a problem hiding this comment.
Very Minor: I prefer explicit names like service_instance_state.
| module VCAP::CloudController | ||
| module Jobs | ||
| module Runtime | ||
| class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob |
There was a problem hiding this comment.
How about calling it DelayedServiceJobsRecover? Something to make clear that this is only for service related jobs.
| # All filter conditions are satisfied: sio is in progress/create/within cutoff, | ||
| # pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set. | ||
| # Override individual parameters to break a single filter and test exclusion. | ||
| def make_stuck_scenario( |
There was a problem hiding this comment.
Minor: maybe prepare_stuck_service_instance fits better?
…failed polling jobs When a CC polling job permanently fails due to a transient error (e.g. a brief DB connection flip), the client is left with no path to a final state: the pollable job shows FAILED while the service instance operation remains stuck 'in progress' indefinitely. Previously this was addressed by reenqueuing the delayed job, but that approach was fragile and incomplete. This cleanup job detects stuck create operations whose polling job has permanently failed (delayed_jobs.failed_at IS NOT NULL) and resolves them by marking the operation and pollable job as failed and triggering OrphanMitigator to deprovision any broker-side resource, giving clients a definitive final state. Extends coverage to service bindings and service keys. Renames the class and file from DelayedJobsRecover to ServiceOperationsCreateInProgressCleanup to reflect the correct scope.
Problem
When CCDB becomes temporarily unreachable during a service broker polling cycle,
the polling job fails permanently even though the broker is still processing the
operation. This leaves the service resource in an inconsistent state:
last_operation.state = 'in progress'(broker still working or already finished)pollable_job.state = FAILED(CC gave up)Solution
Adds a new periodic scheduled job
ServiceOperationsCreateInProgressCleanupthatdetects stuck
createoperations whose polling job has permanently failed andresolves them by marking the operation as failed and triggering orphan mitigation
to deprovision any broker-side resource, giving clients a definitive final state.
Detection chain:
service_instance/binding/key_operations.state = 'in progress'AND
type = 'create'AND
created_atwithin the max async poll window→ JOIN
service_instances/bindings/keys(via foreign key)→ JOIN
pollable_jobs(viaresource_guid) WHEREstate IN (POLLING, FAILED)AND
operation = 'service_instance/bindings/keys.create'→ JOIN
delayed_jobs(viadelayed_job_guid) WHEREfailed_at IS NOT NULLRecovery:
Marks the stuck operation as
failed, sets the pollable job toFAILED, andcalls
OrphanMitigatorto enqueue a broker-side DELETE for the potentiallyorphaned resource. A row-level
FOR UPDATE SKIP LOCKEDguard prevents doublemitigation when multiple CC instances run concurrently.
Scope:
Covers
service_instance.create,service_bindings.create, andservice_keys.create. Delete and update operations are intentionally excluded —retrying a delete on the same resource GUID can cross-match with the old failed
pollable job, making safe recovery impossible without additional guards.
I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the
mainbranchI have run all the unit tests using
bundle exec rakeI have run CF Acceptance Tests